Linux System Administration

Fixing Intermittent SSH Connection Closed by Remote Host Dropouts

The Problem: I would connect with SSH to a staging environment, begin the process of performing a prolonged database migration, then switch over to another terminal and later would have a disconnection from the server; I would now be told “Connection Closed by Remote Host” with absolutely no warning or chance for me to recover the migration that was midway through.

The Constraints: The issues I had were that I was behind a CGNAT (Carrier Grade Network Address Translation) and my ISP aggressively dropped idle connections; the server’s SSH keepalive was left at its default (ClientAliveInterval 0, meaning no probes were being sent); and I could not put every CI job that I had on the displaced box in a terminal multiplexer (i.e., screen or tmux), as this was a fully automated process and the job would expect to keep the same terminal session during execution of the automated job.

The Solution: After performing multiple packet captures and far greater amounts of frustration than I would like to admit, I discovered the disconnections always went back to a couple of items: Keep-Alive timers between client/server not matching, MTU black-holes on the VPN link and a firewall that was killing idle flows after some time. The real solution is to take a layered approach (Server-Side Probes, Client-Side Heartbeats, MTU Sanity) which I will detail below.

Environment: This was tested on Ubuntu 22.04 LTS and OpenSSH 8.9p1, however all of the concepts discussed will apply to any of the modern Linux distributions.

Quick Summary

  • SSH cutouts are almost never due to a bug; 99% of the time it is caused by a timeout or reset from a middlebox/network (e.g., router/firewall).
  • Modify Server’s ClientAliveInterval and ClientAliveCountMax and Client’s ServerAliveInterval configuration settings accordingly.
  • Set TCPKeepAlive to add an additional level of protection; do not rely solely on this setting.
  • Use the command line ping -M do -s to test for any MTU mismatch and modify the MTU size accordingly.
  • Use terminal multiplexers (e.g., tmux/screen) when modifications to the DAEMON configuration are impossible (e.g. CLI build servers).

What Didn’t Work For Me

At first, I configured my ~/.ssh/config file with the ServerAliveInterval option to 15 seconds in order to keep my SSH connection alive and resume my work without fear of being disconnected. I soon discovered that this did not work as intended and that the corporate firewall had shut down my connection after 600 seconds of idle use.

After changing the TCPKeepAlive setting without touching any of the existing kernel parameters, I realized it did not work either because the kernel’s first TCP keepalive probe would not be sent until after 2 hours.

Following these attempts left me with two main conclusions. First, in order for your keepalive frequency to match the most aggressive timeout on your network path, all of your probe options have to be configured together in a unified fashion.

Root Cause Analysis

The “Connection closed by remote host” is simply a symptom of the problem. The TCP RST may originate from the server itself, the client, or an intermediary device. This is how we see the actual events of a connection closing.

SSH Session Timeout: The Role of ClientAliveInterval and TCPKeepAlive

Two mechanisms for maintaining an SSH Session are provided by OpenSSH. The initial mechanism utilizes an encrypted keepalive which sends encrypted probes after not receiving a packet from the client for ClientAliveInterval seconds from the server-side. After ClientAliveCountMax unanswered probes, the server terminates the connection. According to the sshd_config manual, encrypted keepalives are sent via the secure channel, thus are immune from TCP spoofing and are still subject to loss of packet(s).

A common mistake when using this functionality is setting a short ClientAliveInterval and keeping ClientAliveCountMax at 3 (its default). A single lost packet results in the session being terminated prematurely regardless of the quality of the network connection. Therefore, there is a buffer, which I will describe in the following examples of both methods of establishing a connection.

TCP Keepalive and sshd_config Timeouts

TCP keepalive is a kernel-level feature, not an SSH-specific feature. The Linux kernel will use default values for its TCP keepalive parameters, which you can find in the kernel documentation. The defaults are: tcp_keepalive_time = 7200 seconds, tcp_keepalive_intvl = 75 seconds; therefore, the kernel waits over 2 hours before sending out the first keepalive packet! This is far too long for an interactive SSH session.

When you enable TCPKeepAlive yes in sshd_config, the SSH server asks the kernel to monitor the socket. It’s a decent backstop if the network layer craters, but it won’t fire quickly enough to beat a 10‑minute NAT timeout. I always pair it with SSH‑level keepalives for a fast, reliable heartbeat.

Network Layer Causes: MTU, Packet Loss, Firewall Drops

If you are experiencing freezing of your SSH session while issuing commands (e.g., issuing a long ls -la command), you may very well be suffering from a status of “down” as a consequence of an MTU mismatch. In addition, if Path MTU discovery is broken due to the blocking of ICMP fragmentation needed messages by a firewall, you will be impacted in a similar manner. In this scenario, your SSH server sends out a packet with a size of 1500 bytes from a segment that has the Do Not Fragment (DF) bit set, and when it is encapsulated into the VPN tunnel having a maximum packet size of 1400, the packet is dropped without any notification to the sender or receiver. As a result, after an extended period of time without receiving a reply, your SSH client sends a TCP RST message.

If you experience packet loss on an intermediate network hop, that could cause the TCP keepalive probe packets to be lost. Further, if your network has firewalls or NAT devices with idle timeouts of 5 to 15 minutes, the connection state will be destroyed if no traffic occurs for a period exceeding the timeout. When either side attempts to open the TCP connection again, it will be reset and you will no longer have access to your remote host until you reestablish the TCP connection. In my experience, the second situation — destroying the TCP connections when idle for 5 to 15 minutes due to NAT or firewall timeout settings — is the most common occurrence when remote hosts drop their SSH connection.

How to Fix SSH Connection Closed by Remote Host Timeout

Let’s get to fixing this issue. Please follow the instructions below to walk through the changes that I made to both the server and client to have my SSH connections work properly and how to run commands to check that the appropriate changes took effect.

  1. Before making any config changes, create a backup copy of your config file: sudo cp /etc/ssh/sshd_config /etc/ssh/sshd_config.bak. Make sure you have a separate root shell that you use to reload (or restart) the sshd service.

Adjusting ServerAliveInterval and ClientAliveInterval

  • On the server, edit the file /etc/ssh/sshd_config. Add the following two lines right under the Port directive (they are easy to find this way):
ClientAliveInterval 60
ClientAliveCountMax 5
  • This will ensure you send an encrypted keep-alive packet to the client every 60 seconds. If the client fails to respond to five consecutive probes (which would be 300 seconds worth), the server will discard the session. This will give you enough of a buffer to allow for some packet loss from the client without cutting off communications due to the default configurable firewall timer.
  • Next, verify the syntax of your changes and then apply the changes as follows:
sudo sshd -t
sudo systemctl reload sshd
  • Finally, verify that the sshd daemon is running with the desired settings:
sudo sshd -T | grep -E "clientaliveinterval|clientalivecountmax"
  • These changes will be the core of a solution to the problem of the fix ssh connection closed by remote host timeout from the remote side. By configuring the server to be an active prober, you will eliminate any idle drop sessions.

Enabling TCPKeepAlive in sshd_config

After checking if you have your configured settings correct, find and make sure the following remains in your sshd_config file (it is set to true by default):

TCPKeepAlive yes
  • As documented in the sshd_config documentation, TCP keep-alives can allow you to detect the loss of connectivity at the Network Layer. I recommend leaving it on as an added level of protection, even though the kernel’s long default timers mean it won’t react quickly enough on its own.

Setting Client‑side Keepalives in ~/.ssh/config

If you can’t change the server configuration (e.g., you have no control over locked-down bastions). Then you’ll need to handle it from your client. You must add this block to your SSH config file ~/.ssh/config.

Host *
    ServerAliveInterval 30
    ServerAliveCountMax 3

Refer to the ssh_config manual for additional information, which indicates that by setting ServerAliveInterval, your client will send an encrypted request every 30 seconds to the server to request a response.

If the client can’t get a response after trying to send 3 requests (total of 90 seconds), the client will disconnect. This method provides room to handle packets that fail occasionally, without causing false positive disconnects.

To see a demonstration of the above, connect with ssh using verbose output:

ssh -v user@host

Review the information printed to the console when connected with verbose output. If successful, it will show the ServerAliveInterval 30 setting in the configuration dump it generates. This will cause the client to make an attempt to communicate with the server every 30 seconds to keep the connection open through an aggressive NAT configuration.

Edge Cases and Undocumented Workarounds

The common fixes described above do not always work out in the real world. Here are some examples where I had to pull my hair out.

MTU Mismatch: Diagnosing and Fixing with ping -M do

With a site-to-site VPN, the SSH connection would time-out or lock-up while issuing commands that produced a lot of output, i.e. a wall of text. The problem originated from an MTU blackhole.

Diagnose this issue by sending a ping request to the remote site with a frame size configured that doesn’t allow segmentation:

ping -M do -s 1472 

If you can ping with a smaller size, such as 1300 bytes, but not with the original size of 1500 bytes, you have an MTU mismatch (e.g. original payload size of 1472 bytes + TCP header = 1500 bytes). You can repeat the test using various smaller frame sizes until a successful ping is achieved. If a size of 1372 succeeds, your effective MTU will be set to 1400 bytes.

Temporary solution for a broken interface

On Linux, you can temporarily correct the problem using the following command:

sudo ip link set dev eth0 mtu 1400

To see if you have exceeded the maximum MTU for your wireless interface:

ip link show eth0 | grep mtu

This fix is not persistent, so for long-term employment, you will want to include it in your network configuration.

I can’t even remember how many times I have used this “fix” on IPSec tunnels where I have lost ICMP due to excessive MTU.

Undocumented Fallback: Using tmux or screen to Survive Drops

When you do not have control over the network connection path such as with jump hosts or certain policies on client devices, you will want to run your live SSH session under tmux. If your SSH connection dies, you can re-connect via a new SSH connection and run the command tmux attach. All processes from your console/shell will still be running under the tmux.

tmux new -s worksession

This is not a root-cause fix, but it is the handy safety net every admin has in their back pocket to protect themselves.

Advanced Debugging: Wireshark and tcpdump Analysis

With random hang-ups or drops in your SSH connection, you need to see what happened.

Using tcpdump to Capture SSH Traffic

To only capture traffic on port 22, you can run the following command in your terminal window.

sudo tcpdump -i any port 22 -w ssh_drop.pcap

When your SSH session fails due to a network issue, the output will be stored in the capture file you specified (ssh_drop.pcap).

To retrieve your capture file to your workstation, please see instructions on the tcpdump man page on how to extract files from the output.

Analyzing TCP RST Packets in Wireshark

After you have captured your SSH session packets, open the file you just created in Wireshark.

In Wireshark, you can display only TCP RST packets by using the display filter “tcp.flags.reset == 1”. This will allow you to find the TCP RST packet that ended your SSH session. You will want to make note of the source IP address of the TCP RST packet. If the source IP address is from your computer, then it is likely your ServerAliveCountMax configuration statement fired. If your server’s connection is through an intermediate device (firewall or NAT box), then this RST will have been injected by that device due to its external connection timeout value having expired. Refer to the Wireshark TCP Analysis wiki for more details on sequence numbers and spoofed resets.

Reference to an example of where a stateful firewall was RST-ing my connection after 900 seconds of no traffic. Once I found this duration, I changed app’s keepalives to 60 seconds and stopped the connection drop problem at that point.

Debugging sshd Connection Drops via Logs

Turn up the sshd log level on the server during your debugging session by using the following command(s):

LogLevel DEBUG3

Once you have done this, restart sshd, then watch the log(s) for the following events:

tail -f /var/log/auth.log

You will be able to see each keepalive probe, the timeouts (if any), and the final disconnects. By reviewing these logs, you will be able to tell if your ClientAliveCountMax is being exhausted prematurely. When finished, turn the log level back to INFO unless you want to fill your disk with logs.

Prevention & Best Practices

Hardening Keep‑Alive Settings for All Servers

Create a new users file (or update an existing one) to use the same sshd_config file across all servers by using automation tools such as Ansible, Chef, etc.

In your golden images, include a ServerAliveInterval setting of 30 seconds on the client side to provide an additional layer of protection. This layered approach kills 95 % of SSH timeout tickets before they reach your desk.

Packet Loss Troubleshooting with MTR and Pathping

Using MTR, you can see the packet loss at each hop in real time on the network:

mtr -r -c 100 

HOST: myhost                   Loss%   Snt   Last   Avg  Best  Wrst StDev
  5.|-- 203.0.113.17             12.0%   100   12.3  13.1  10.2  45.6   8.1

Keep an eye out for any sustained packet losses on each hop as they will result in a loss of keepalive packets.

If there is a sustained packet loss of more than 12 % on a hop, it will cause your connection to drop after only a few missed keepalives. If you experience this situation and this hop is with an upstream provider, your best resource is to reroute your connection around them (possibly using a different VPN endpoint).

Monitoring and Alerting on Connection Failures

To monitor your SSH-based jobs for connection failures, consider a very simple use case of running a cron job to perform an ssh -o ConnectTimeout=5 -o BatchMode=yes and trigger an alert for any failure. For more sophisticated solutions, you could investigate the use of the Prometheus Blackbox Exporter for TCP connection checks. However, I have found that a very simple bash script will typically identify real-world SSH drops and recover much faster than most of the existing monitoring systems.

Frequently Asked Questions

Why does my SSH drop exactly every 15 minutes of inactivity?

This is most likely due to a timeout on a firewall or NAT device on your network. Many devices have a default NAT timeout value of 900 seconds (15 minutes). If no activity has occurred for that period of time, the firewall or device removes the entry from its NAT mappings. The next TCP packet received will then trigger an RST. To prevent an SSH disconnect due to inactivity, set the ServerAliveInterval to a value between 30 and 60 seconds on your client.

Can NAT or VPN devices cause SSH disconnects?

Yes, NAT devices have NAT mappings that eventually “age out” and will no longer be available, and VPNs can also cause problems due to MTU sizes. The combination of setting keepalives on both the client and server and tuning the MTU size should resolve most issues from both NAT and VPN-based causes.

How do I test for an MTU bottleneck that is causing SSH drops?

Run ping -M do -s 1472 <server> as shown above. If it fails while smaller sizes succeed, you have an MTU bottleneck—the path cannot handle the full 1500‑byte frame. Reducing the interface MTU on the tunnel endpoints almost always fixes the hanging.

Leave a Reply

Your email address will not be published. Required fields are marked *

Back to top button